Genome Medicine — Latest Matching Preprints

1

Standardized transcriptome analysis improves rare disease diagnosis in the pan-European Solve-RD consortium

Yepez, V. A.; Luknarova, R.; Beijer, D.; Estevez-Arias, B.; Mei, D.; Morsy, H.; Mueller, J. S.; Polavarapu, K.; Demidov, G.; Doornbos, C.; Ellwanger, K.; Krass, L.; Laurie, S.; Matalonga, L.; Abdelrazek, I. M.; Astuti, G.; Bisulli, F.; Brechtmann, F.; Dabad, M.; Denomme Pichon, A. S.; Drakos, M.; Eddafir, Z.; Garrabou, G.; Guerrini, R.; Johari, M.; Kegele, J.; Kilicarslan, O. A.; Koelbel, H.; Kolen, I. H. M.; Licchetta, L.; Lochmueller, H.; Maassen, K.; Macken, W.; Mertes, C.; Milisenda, J. C.; Minardi, R.; Mostacci, B.; Neveling, K.; Oud, M. M.; Park, J.; Pujol, A.; Roos, A.; Sagath, L.; van

2026-02-14 genetic and genomic medicine 10.64898/2026.02.10.26345954 medRxiv

Top 0.1%

67.0%

Show abstract

RNA sequencing (RNA-seq) provides a powerful complement to DNA sequencing for uncovering pathogenic defects affecting gene expression and splicing in individuals with genetically undiagnosed rare disorders. However, as large rare disease consortia adopt RNA-seq, challenges arise due to cohort heterogeneity, variability in tissues and sample sizes, and differences in interpretation practices. Here, we present a harmonized analytical and interpretation framework developed by the pan-European Solve-RD consortium to address these challenges. We analyzed 521 RNA-seq samples from whole blood, fibroblasts, muscle and peripheral blood mononuclear cells collected across more than 30 clinics and five European Reference Networks. Aberrant expression and splicing events were identified using OUTRIDER and FRASER 2.0 and analysed through a standardized four-level scoring framework that encompassed RNA-seq outlier reliability, phenotype relevance, variant mechanism, and segregation evidence, captured in structured reports for interpretation. Regular meetings, and collaborative "Solvathon" workshops were used to evaluate variant pathogenicity. This effort resulted in molecular diagnoses for 19 families out of 248 (7.7%) for whom DNA analyses had been inconclusive. Furthermore, three cases diagnosed using DNA analyses were confirmed, and 49 candidate events and five novel candidate disease genes were identified in the remaining families. Our results demonstrate the feasibility and impact of large-scale, standardized RNA-seq analysis in a transnational research setting. This framework provides a model for other international initiatives such as the Undiagnosed Diseases Network and ERDERA, paving the way for broader clinical implementation of transcriptome-based rare disease diagnostics.

2

Complete pharmacogenomic profile from exome sequencing

Bensouna, I.; Grujic, A.; Ponce, F.; Jauniaux, N.; Scheikl, T.; Picard, N.; Chaumette, B.; Hatz, K.-D.; Vanhoye, X.; Mesnard, L.; Raymond, L.

2026-01-19 genetic and genomic medicine 10.64898/2026.01.13.26343772 medRxiv

Top 0.1%

54.2%

Show abstract

Exome sequencing (ES) is a cornerstone of clinical genetic diagnosis, yet its application in pharmacogenomics remains limited. While some pharmacogenetic variants are detectable by ES, clinically relevant loci such as CYP2D6, UGT1A1, and HLA remain challenging. We present a robust, comprehensive method to derive a complete pharmacogenomic profile directly from standard ES data. Our method addresses primary limitations of ES for pharmacogenomics, including low coverage and structural complexity at critical loci. We analyzed 66 samples from diverse sources, targeting 217 variants across a panel of 23 pharmacogenes. The method was validated by comparing its results with reference samples from the Genetic Testing Reference Material Coordination Program, as well as with the Veridose Core+CNV assay(R) (Agena) and the Personal Medicine Profile assay(R) (GeneTelligence). HLA typing performance was assessed and confirmed through comparison with both the Immucor LIFECODES HLA-SSO kit and a clinical transplantation-grade HLA assay. This validation demonstrates that ES can provide a comprehensive pharmacogenomic profile in a single, streamlined workflow, facilitating seamless integration of pharmacogenomics into precision medicine.

3

The impact of systematized generation, evaluation, and incorporation of machine learning algorithms for clinical variant classification

Fresard, L.; Facio, F. M.; Chen, E.; Colavin, A.; Johnson, B.; Araya, C.; Manders, T.; Wahl, A.; Metz, H.; Nicoludis, J. M.; Ouyang, K.; Padigepati, S.; Kobayashi, Y.; Reuter, J.; Nykamp, K.

2025-02-06 genetic and genomic medicine 10.1101/2025.02.03.25321356 medRxiv

Top 0.1%

51.6%

Show abstract

Variants of uncertain significance (VUS) pose a significant challenge for those undergoing genetic testing, leading to prolonged uncertainty and inappropriate medical care. VUS rate reduction is critical to fully realize the utility of genetic testing for all populations. With the growth of large-scale biological data sources and modern Machine Learning (ML) techniques, predictive modeling has enormous potential for VUS reduction. For this purpose, we developed the Invitae Evidence Modeling Platform (EMP), with key features designed to maximize the utility and confidence of predictive algorithms for variant classification. First, input data for a new model is curated to correspond to a single major evidence category within a variant classification framework. Second, gene-specific training and/or validation is performed for each model type. Third, accuracy thresholds are set to filter out gene-specific models that do not meet stringent accuracy metrics. Finally, prediction scores for variant pathogenicity are calibrated to ensure internally consistent evidence weighting within the classification framework. The EMP has accelerated the development of ML algorithms and greatly expanded the amount of evidence available for variant classification. EMP evidence has been applied to more than 800,000 variants across 1 million individuals, 42% of which would have been VUS without this evidence. Importantly, definitive classifications (P, LP, LB, B) made with EMP evidence have high prospective concordance (>99%) with ClinVar submissions. Finally, we demonstrate that further use and development of EMP evidence for variant classification has the potential to reduce the VUS disparity across race/ethnicity/ancestry (REA) groups.

4

Combined genomic and phenotypic classification of inherited and acquired genetic disease with long-read sequencing

Fontaine, Y.; Aradhya, S.; Ji, S.; Masle-Farquhar, E.; Bosi, I.; Forkgen, J.; Qiao, Z.; Stevanovski, I.; Reis, A. M.; Rapadas, M.; Suan, D.; Hsu, P.; Wainstein, B.; Hewitt, A. W.; Figtree, G. A.; Deveson, I. W.; Siggs, O. M.

2025-05-09 genetic and genomic medicine 10.1101/2025.05.08.25327265 medRxiv

Top 0.1%

51.4%

Show abstract

Current long read sequencing (LRS) platforms allow the simultaneous detection of both genetic variation and epigenetic modification, yet in most cases only genetic variation is utilised. Here we demonstrate the additional potential utility of methylation-based cell type deconvolution and outlier detection, using LRS data from two different platforms. This approach could reliably estimate the proportions of the most abundant cellular constituents of human blood and peripheral blood mononuclear cells, using either primary samples or synthetic mixtures of data from purified cells. Using samples from patients with a hematological malignancy (B cell chronic lymphocytic leukemia) or immunodeficiency (X-linked agammaglobulinemia), LRS resolved both the disease-associated variant and primary cellular phenotype in a single assay. The LRS approach yielded similar cell proportion estimates to orthogonal scRNAseq data generated on the same samples. LRS-derived methylation data therefore represents an incidental source of phenotypic data, with potential future utility in genomic discovery in population-scale biobanks, and in the investigation of inherited and acquired genetic disease.

5

Unveiling the Landscape of Reportable Genetic Secondary Findings in the Spanish Population: A Comprehensive Analysis Using the Collaborative Spanish Variant Server Database

Carmona, R.; Perez-Florido, J.; Roldan, G.; Loucera, C.; Aquino, V.; Toro-Barrios, N.; Fernandez-Rueda, J. L.; Bostelmann, G.; Lopez-Lopez, D.; Ortuno, F. M.; Morte, B.; CSVS Crowdsourcing Group, ; Pena-Chilet, M.; Dopazo, J.

2024-08-03 genetic and genomic medicine 10.1101/2024.08.01.24311343 medRxiv

Top 0.1%

49.3%

Show abstract

The escalating adoption of Next Generation Sequencing (NGS) in clinical diagnostics reveals genetic variations, termed secondary findings (SFs), with health implications beyond primary diagnoses. The Collaborative Spanish Variant Server (CSVS), a crowdsourced database, contains genomic data from more than 2100 unrelated Spanish individuals. Following the American College of Medical genetics (ACMG) guidelines, CSVS was analyzed, identifying pathogenic or likely pathogenic variants in 78 actionable genes (ACMG list v3.1) to ascertain SF prevalence in the Spanish population. Among 1129 samples, 60 reportable SFs were found in 5% of individuals, impacting 32 ACMG-listed genes, notably associated with cardiovascular disease (59.4%), cancer (25%), inborn errors of metabolism (6.3%), and other miscellaneous phenotypes (9.4%). The study emphasizes utilizing dynamic population databases for periodic SF assessment, aligning with evolving ACMG recommendations. These findings illuminate the prevalence of significant genetic variants, enriching understanding of secondary findings in the Spanish population.

6

Genomize-SEQ: An NGS data analysis platform for genomic variant classification and prioritization

Kavak, E.; Aslan, T.; Karaman, R.; Aydin, C.; Ozer, T.; Sunnetci Akkoyunlu, D.; Savli, H.; Cine, N.; Seker, T.

2025-09-07 genetic and genomic medicine 10.1101/2025.09.05.25335160 medRxiv

Top 0.1%

48.0%

Show abstract

Accurate interpretation of diverse genetic variants remains a pivotal challenge in the diagnosis of rare diseases. Although evidence-based guidelines established by the American College of Medical Genetics and Genomics have enhanced the precision of variant assessment, the practical implementation of this evidence-based classification can be challenging. The inherent genetic heterogeneity in rare diseases, coupled with the need to integrate information from numerous databases, contributes to this complexity. Therefore, advancements in secondary variant calling, automated variant annotation and prioritization, visualization of variant annotations with the raw data, and a streamlined reporting process are crucial for efficient and robust analysis. Here we present Genomize-SEQ, a web-based clinical genomics analysis software that has all of these capabilities, with which more than 300,000 patients have been analyzed to date. Genomize-SEQ collects data from more than 120 different databases to annotate the variants according to ACMG/AMP guidelines and prioritize the variants that could be causative for the clinical presentation of a patient. Genomize-SEQ can also perform real-time data aggregation to calculate variant frequencies in each center as well as the community. This capability helps clinicians to analyze variants more easily in regions without genome projects or in populations underrepresented in existing databases. We validated the annotation capacity of Genomize-SEQ by performing a systematic comparison of ACMG pathogenicity prediction from widely used algorithms and Genomize-SEQs algorithm, using ClinGens expert curation dataset as a truth set. In addition, we tested the prioritization efficiency of Genomize-SEQ by using real-world whole-exome sequencing data of 215 patients with pre-diagnostic and phenotypic information. Genomize-SEQ identified the causative variants with a 97% success rate, with 52% of these variants ranked in the top position and over 90% ranked within the top 20. Thus, Genomize-SEQ provides a complete solution for comprehensive variant interpretation to achieve fast and reliable diagnosis for rare diseases from next-generation sequencing data.

7

VSPGx: A High-Accuracy Pharmacogenomics Interpretation Software Solution with Automated CPIC Guideline Integration

Fortier, N.; Rudy, G.; Scherer, A.

2025-11-26 bioinformatics 10.1101/2025.11.24.690276 medRxiv

Top 0.1%

44.9%

Show abstract

Accurate pharmacogenomic genotype determination and interpretation are essential for personalized medicine, yet existing bioinformatics tools face significant limitations in detecting named alleles, maintaining current allele definitions, and providing comprehensive clinical annotations. We present VSPGx, a pharmacogenomics interpretation software solution that identifies diplotypes from next-generation sequencing data and annotates them against Clinical Pharmacogenetics Implementation Consortium (CPIC) and FDA drug recommendations using automated curation of the latest allele definitions. We benchmarked VSPGx against established tools including Aldy, PharmCAT, and Stargazer using both synthetic datasets and real-world clinical samples. In a comprehensive synthetic benchmark spanning 3,655 CYP2C9 diplotype combinations, VSPGx achieved 99.97% concordance, matching PharmCATs performance and substantially outperforming Aldy (93.08%) and Stargazer (27.06%). Clinical validation using 11 TaqMan OpenArray samples demonstrated 88.2% allele concordance and 89.1% phenotype concordance across 110 gene-sample combinations, with all discrepancies attributed to the benchmark data utilizing outdated allele definitions rather than VSPGx errors. Our automated curation process ensures continuous alignment with current CPIC guidelines, addressing a critical gap in existing pharmacogenomic analysis tools. VSPGx provides a robust, clinically-validated solution for pharmacogenomic analysis that combines high-accuracy diplotype calling with up-to-date, evidence-based drug recommendations.

8

GA4GH Phenopacket-Driven Characterization of Genotype-Phenotype Correlations in Mendelian Disorders

Rekerle, L.; Danis, D.; Rehburg, F.; Graefe, A. S.; Bily, V.; Caballero-Oteyza, A.; Cacheiro, P.; Chimirri, L.; Chong, J. X.; Connelly, E.; de Vries, B. B.; Dingemans, A. J.; Duyzend, M. H.; Freiberger, T.; Gehle, P.; Groza, T.; Hansen, P.; Jacobsen, J.; Klocperk, A.; Ladewig, M. S.; Love, M. I.; Marcello, A. J.; Mordhorst, A.; Munoz-Torres, M. C.; Reese, J.; Schuetz, C.; Smedley, D.; Strauss, T.; Vladyka, O.; Zocche, D.; Thun, S.; Mungall, C. J.; Haendel, M. A.; Robinson, P. N.

2025-03-06 genetic and genomic medicine 10.1101/2025.03.05.25323315 medRxiv

Top 0.1%

44.0%

Show abstract

Comprehensively characterizing genotype-phenotype correlations (GPCs) in Mendelian disease would create new opportunities for improving clinical management and understanding disease biology. However, heterogeneous approaches to data sharing, reuse, and analysis have hindered progress in the field. We developed Genotype Phenotype Evaluation of Statistical Association (GPSEA), a software package that leverages the Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema to represent case-level clinical and genetic data about individuals. GPSEA applies an independent filtering strategy to boost statistical power to detect categorical GPCs represented by Human Phenotype Ontology terms. GPSEA additionally enables visualization and analysis of continuous phenotypes, clinical severity scores, and survival data such as age of onset of disease or clinical manifestations. We applied GPSEA to 85 cohorts with 6613 previously published individuals with variants in one of 80 genes associated with 122 Mendelian diseases and identified 225 significant GPCs, with 48 cohorts having at least one statistically significant GPC. These results highlight the power of standardized representations of clinical data for scalable discovery of GPCs in Mendelian disease.

9

Single-nucleus multiomic landscape of congenital heart diseases reveals disease-specific genotypic profiles

Lukovic, D.; Gyongyosi, M.; White, B.; Han, E.; Hasimbegovic, E.; Müller-Zlabinger, K.; Gynter, A.; Mancikova, V.; Pavo, I. J.; Michelitsch, M.; Michel-Behnke, I.

2025-08-05 pediatrics 10.1101/2025.08.01.25332780 medRxiv

Top 0.1%

43.5%

Show abstract

Tetralogy of Fallot (TOF) is the most common cyanotic congenital heart defect (CHD), whereas hypoplastic left heart syndrome (HLHS) represents 2-3% of all CHDs. We analyzed the single-nucleus multiome profile of right ventricle samples obtained from children during routine cardiac surgery for correction of TOF or staged surgical palliation for HLHS to define cell types and characterize gene regulation states in different cell clusters. Data were integrated with pre-existing controls and analyzed using Scanpy to identify clusters and annotated using automated tool and manual curation, pycisTopic to identify cell stats and cis-regulatory topics, and SCENIC+ for enhancer-driven gene regulatory network (eRegulon). Integrated RNA-seq analysis identified 22 different cell subtypes, including five cardiomyocyte phenotypes. TOF samples showed involvement in pathway networks of the cell cycle, DNA repair, DNA replication, and RNA metabolism, whereas gene expression in HLHS samples was related to extracellular matrix organization, anatomical structure development, cell adhesion, actin-myosin filament sliding, and contractile muscle fiber pathways. In addition, the gene expression fingerprints of endothelial, fibroblast, pericyte, immune, and neuronal cell nuclei from TOF and HLHS samples exhibited nuclei-specific significant de-regulation compared to controls. We found considerable heterogeneity among the transcriptomes of TOF and HLHS, explaining the diverse clinical phenotypes. These findings can enable the development of new gene-based interventions for specific CHDs.

10

MaveMD: A functional data resource for genomic medicine

McEwen, A. E.; Stone, J.; Tejura, M.; Gupta, P.; Capodanno, B. J.; Da, E. Y.; Grindstaff, S. B.; Moore, N.; Snyder, A. E.; Stergachis, A. B.; Starita, L. M.; Fowler, D. M.; Rubin, A. F.

2025-11-19 genetic and genomic medicine 10.1101/2025.11.15.25336228 medRxiv

Top 0.1%

41.0%

Show abstract

Variants of uncertain significance (VUS) undermine genetic medicine implementation because they have an unknown relationship to disease and cannot be used for clinical decision-making. While evidence from multiplexed assays of variant effect (MAVEs) can help resolve VUS, major barriers prevent routine clinical use, including data fragmentation and assay calibration. To address these challenges, we present MaveMD (MAVEs for MeDicine), a new interface for the MaveDB database that displays clinical evidence calibrations, provides intuitive visualizations, integrates with ClinVar and ClinGen, and exports clinical evidence compatible with ACMG/AMP guidelines. MaveMD currently contains 476,076 variant effect measurements curated from 82 MAVE datasets spanning 39 disease-associated genes, enabling classification of 75% of ClinVar VUS and 62% of future variants in these genes. MaveMD is designed to support and facilitate future data generation efforts and the use of MAVE evidence in clinical practice, thereby reducing the VUS burden and improving genetic medicine outcomes.

11

deCYPher: Star Allele-Resolution Computational Framework of Pharmacogenes for Haplotype-Resolved Long-Read Assemblies

Chang, T.-Y.; Liu, Y.-S.; Lai, H.-S.; Hung, T.-K.; Lin, H.-F.; Lin, Y.-H.; Hsu, C.-L.; Yang, Y.-C.; Chen, C.-Y.; Chen, P.-L.; Hsu, J. S.

2025-11-03 bioinformatics 10.1101/2025.10.13.681303 medRxiv

Top 0.1%

40.9%

Show abstract

Although existing next-generation sequencing (NGS) tools, such as Aldy and Cyrius, have been applied for allele typing, they cannot achieve complete accuracy due to various genomic challenges including pseudogenes, structural variations, hybrid genes, copy number variations, and gene deletions. These complexities make accurate pharmacogene interpretation more challenging, despite the crucial role pharmacogenomics plays in precision medicine. We developed deCYPher, a tool that generates personalized pharmacogenomic reports from haplotype-resolved assemblies. The tool enables analysis of all PharmVar 1A level genes, such as CYP2B6, CYP2C9, CYP2C19, CYP2D6, CYP3A5, CYP4F2, DPYD, NUDT15, and SLCO1B1. Applied to all HPRC haplotypes (including both release 1 and release 2 data), deCYPher demonstrated high accuracy in resolving complex gene structures. In the case of CYP2D6, release 1 identified 6% gene multiplications, 6% full gene deletions, and 4% CYP2D6/CYP2D7 hybrids. By contrast, release 2 demonstrated an increased prevalence of multiplications (14%) and hybrids (11%), while the frequency of full gene deletions remained comparable at 5%. Comparison with pb-StarPhase revealed discrepancies in 12 of 94 assemblies in the release 1 dataset. For instance, in sample HG02257, Aldy, Cyrius, and deCYPher consistently identified the genotype as *2/*35, whereas pb-StarPhase reported *2/*2. Notably, the *35-defining variants were present in the BAM and VCF files in the pb-StarPhase pipeline, but the local read depth over the *35-specific region was only 5x in HG02257-p, suggesting that the misclassification likely resulted from insufficient coverage - a known limitation of pb-StarPhase under low-depth conditions.

12

PhenoScore: AI-based phenomics to quantify rare disease and genetic variation

Dingemans, A. J. M.; Hinne, M.; Truijen, K. M. G.; Goltstein, L.; van Reeuwijk, J.; de Leeuw, N.; Schuurs-Hoeijmakers, J.; Pfundt, R.; Diets, I. J. M.; den Hoed, J.; de Boer, E.; Coenen-van der Spek, J.; Jansen, S.; van Bon, B. W.; Jonis, N.; Ockeloen, C.; Vulto-van Silfhout, A. T.; Kleefstra, T.; Koolen, D. A.; Campeau, P. M.; Palmer, E. E.; Van Esch, H.; Lyon, G. J.; Alkuraya, F. S.; Rauch, A.; Marom, R.; Baralle, D.; van der Sluijs, P. J.; Santen, G. W. E.; Kooy, R. F.; van Gerven, M. A. J.; Vissers, L. E. L. M.; de Vries, B. B. A.

2022-10-26 genetic and genomic medicine 10.1101/2022.10.24.22281480 medRxiv

Top 0.1%

40.9%

Show abstract

While both molecular and phenotypic data are essential when interpreting genetic variants, prediction scores (CADD, PolyPhen, and SIFT) have focused on molecular details to evaluate pathogenicity -- omitting phenotypic features. To unlock the full potential of phenotypic data, we developed PhenoScore: an open source, artificial intelligence-based phenomics framework. PhenoScore combines facial recognition technology with Human Phenotype Ontology (HPO) data analysis to quantify phenotypic similarity at both the level of individual patients as well as of cohorts. We prove PhenoScores ability to recognize distinct phenotypic entities by establishing recognizable phenotypes for 25 out of 26 investigated genetic syndromes against clinical features observed in individuals with other neurodevelopmental disorders. Moreover, PhenoScore was able to provide objective clinical evidence for two distinct ADNP-related phenotypes, that had already been established functionally, but not yet phenotypically. Hence, PhenoScore will not only be of use to unbiasedly quantify phenotypes to assist genomic variant interpretation at the individual level, such as for reclassifying variants of unknown clinical significance, but is also of importance for detailed genotype-phenotype studies.

13

aiDIVA - Diagnostics of Rare Genetic Diseases Using Large Language Models

Boceck, D.; Laugwitz, L.; Sturm, M.; Bezdan, D.; Gschwind, A.; Haack, T. B.; Ossowski, S.

2025-09-07 genetic and genomic medicine 10.1101/2025.09.04.25335099 medRxiv

Top 0.1%

40.7%

Show abstract

Genome sequencing (GS) enables the accurate identification of genetic variants in most genomic regions and is rapidly transforming routine diagnostics for rare diseases (RD). While streamlined data generation is scalable, efficient prioritization and correct clinical interpretation of detected alterations remain a challenge, often requiring manual classification by experts with years of training. Hence, there is a need for AI-driven clinical decision support systems that assist clinical experts in identifying causal variants or, in case of large-scale re-analysis of unsolved cases, fully automate the process. To this end, many tools have been developed to estimate the impact of variants on protein function. However, only a small number of tools combine genomic data, variant annotations, and phenotypic data to diagnose cases. Here we introduce aiDIVA, an ensemble-AI featuring a hierarchically organized set of statistical and machine learning models trained on genomic and phenotypic data to identify the causal variant(s) among tens of thousands of genetic variants of a patient. aiDIVA generates pathogenicity classifications for each variant using a random forest AI model and an evidence-based score for dominant and recessive diseases. It combines these predictions with additional clinical metadata to prioritize and rank the most likely causal variants. aiDIVA uses large language models (LLMs) to further improve and explain the results. Finally, the aiDIVA-meta model combines all scores to generate a ranked list of variants. In a benchmark analysis on more than 3,000 diagnostically solved RD patients, the causal variant was included in 97% of the cases among the top-3 candidate variants reported by aiDIVA-meta. Unlike comparative methods, aiDIVA provides interpretable explanations for the best candidates.

14

An artificial intelligence-based model for prediction of Clonal Hematopoiesis mutants in cell-free DNA samples

Arango-Argoty, G.; Haghighi, M.; Sun, G. J.; Markovets, A.; Barrett, J. C.; Lai, Z.; Jacob, E.

2024-12-16 bioinformatics 10.1101/2024.12.11.627785 medRxiv

Top 0.1%

40.6%

Show abstract

Circulating tumor DNA is a critical biomarker in cancer diagnostics, but its accurate interpretation requires careful consideration of clonal hematopoiesis (CH), which can contribute to variants in cell-free DNA and potentially obscure true tumor-derived signals. Accurate detection of somatic variants of CH origin in plasma samples remains challenging in the absence of matched white blood cells sequencing. Here we present an open-source machine learning framework (MetaCHIP) which classifies variants in cfDNA from plasma-only samples as CH or tumor origin, surpassing state-of-the-art classification rates.

15

A Lung Cancer Mouse Model Database

Cai, L.; Gao, Y.; DeBerardinis, R. J.; Chen, T.; Winslow, M. M.; Xiao, G.; Rudin, C.; Oliver, T. G.; Minna, J. D.; Xie, Y.

2024-02-29 cancer biology 10.1101/2024.02.28.582577 medRxiv

Top 0.1%

40.1%

Show abstract

Lung cancer, the leading cause of cancer mortality, exhibits diverse histological subtypes and genetic complexities. Numerous preclinical mouse models have been developed to study lung cancer, but data from these models are disparate, siloed, and difficult to compare in a centralized fashion. Here we established the Lung Cancer Mouse Model Database (LCMMDB), an extensive repository of 1,354 samples from 77 transcriptomic datasets covering 974 samples from genetically engineered mouse models (GEMMs), 368 samples from carcinogen-induced models, and 12 samples from a spontaneous model. Meticulous curation and collaboration with data depositors have produced a robust and comprehensive database, enhancing the fidelity of the genetic landscape it depicts. The LCMMDB aligns 859 tumors from GEMMs with human lung cancer mutations, enabling comparative analysis and revealing a pressing need to broaden the diversity of genetic aberrations modeled in GEMMs. Accompanying this resource, we developed a web application that offers researchers intuitive tools for in-depth gene expression analysis. With standardized reprocessing of gene expression data, the LCMMDB serves as a powerful platform for cross-study comparison and lays the groundwork for future research, aiming to bridge the gap between mouse models and human lung cancer for improved translational relevance.

16

Characterizing SCN1A-Related Disorders Using Real-World Data Across 681 Patient-Years

Prentice, A. J.; McSalley, I.; Magielski, J. H.; Mercurio, J.; Tefft, S.; Winters, A.; Kaufman, M. C.; Ruggiero, S. M.; McGarry, L. M.; Hood, V.; McKee, J. L.; Goldberg, E. M.; Helbig, I.

2026-03-02 genetic and genomic medicine 10.64898/2026.02.24.26346493 medRxiv

Top 0.1%

40.0%

Show abstract

SCN1A-related disorders are the single most common monogenic cause of epilepsy and represent a major focus of precision medicine efforts. In conjunction with existing prospective studies, the analysis of real-world data obtained during routine clinical care can expand upon the scale and duration of available data and contribute to the development of meaningful outcomes for clinical trials. Here, we leveraged real-world data to delineate the longitudinal disease history of 100 individuals with SCN1A-related disorders using a systematic approach. We mapped a total of 671 unique clinical terms to a standardized framework in monthly increments across 681 patient-years, including 75 terms related to seizure types. Within this cohort, 89 individuals had presumed loss-of-function variants in SCN1A based on variant type and clinical diagnosis, including those with Dravet syndrome (N = 79) and genetic epilepsy with febrile seizures plus (N = 10). Ten individuals had a non-Dravet developmental and epileptic encephalopathy caused by gain-of-function variants in SCN1A. By annotating seizure type and frequency in monthly time-bins, we assessed seizure burden. A median of 17 changes in seizure frequency and ten terms referring to seizure type were identified per participant. Myoclonic seizures occurred with high frequency (median >5 daily), whereas hemiclonic, focal impaired consciousness, and bilateral tonic-clonic seizures occurred more rarely (median monthly). Retrospective analysis of developmental histories showed a range of cognitive abilities. Neurodevelopmental differences were observed in 83% (83/100) of individuals, of whom 83% (69/83) demonstrated delayed language skills. Motor coordination impairments, including gait disturbance, ataxia, hypotonia, and imbalance were annotated in 69% (69/100) of participants. EEG findings varied with age; most were reported as normal before nine months of age, after which the prevalence of abnormal interictal findings increased. Individuals with different clinical syndromes had unique medication landscapes, with 554 prescriptions of 37 unique therapies. Changes in treatment coincided with the diagnosis of an SCN1A-related disorder, with an increase in cannabidiol, clobazam, and fenfluramine and reduction in sodium channel-blocker use following genetic diagnosis. In summary, we reconstructed the longitudinal disease history of SCN1A-related disorders from electronic medical records using a standardized framework for the analysis of real-world clinical data. We refine existing natural history data of SCN1A-related disorders by providing a granular landscape of seizures, comorbidities, and treatment approaches over time.

17

Clinical Evaluation of an AI System for Streamlined Variant Interpretation in Genetic Testing

Ruzicka, J.; Ravel, J.-M.; Audoux, J.; Boulat, A.; Thevenon, J.; Yauy, K.; Dancer, M.; Raymond, L.; Lombardi, Y.; Philippe, N.; Blum, M. G.; Duforet-Frebourg, N.; Mesnard, L.

2025-02-05 genetic and genomic medicine 10.1101/2025.02.04.25321641 medRxiv

Top 0.1%

39.9%

Show abstract

The growing use of genomic sequencing to diagnose hereditary diseases has increased the interpretive workload for clinical laboratories. Efficient methods are needed to maximize diagnostic yield without overwhelming resources. We developed DiagAI, an integrative machine-learning system trained to prioritize and sort causal variants in rare diseases. DiagAI integrates Universal Pathogenicity Predictor (UP2), a machine-learning model trained to predict ACMG pathogenicity classes, PhenoGenius to match genotype-phenotype interactions and expert features such as inheritance and variant quality. We retrospectively analyzed 196 diagnosed exomes from a nephrology cohort. To benchmark UP2s performance, we evaluated the ranking of 62 causal missense variants. UP2 ranked variants most effectively beyond shortlist sizes of 10 and identified pathogenic variants missed by AlphaMissense. DiagAI identified 94.9% of causal variants in diagnostic exomes with HPO terms, compared to 90.8% without, with median shortlist sizes of 12 and 9 variants, respectively. With HPO terms, 74% of top-ranked variants were diagnostic, versus 42% without, outperforming Exomiser and AI-MARRVEL. DiagAI produces accurate shortlists that streamline variant interpretation, offering a scalable solution for growing diagnostic volumes.

18

HiFi sequencing accurately identifies clinically relevant variants in paralogous genes

van der Sanden, B.; Betz, C.; Herzog, K.; Schamschula, E.; Wimmer, K.; Vater, I.; Balachandran, S.; Chen, X.; Corominas Galbany, J.; Timmermans, R.; Derks, R.; HiFi Solves EMEA Consortium, ; Spielmann, M.; Eberle, M. A.; Gilissen, C.; Vissers, L. E. L. M.; Zschocke, J.; Bolz, H. J.; Hoischen, A.

2025-10-31 genetic and genomic medicine 10.1101/2025.10.29.25339045 medRxiv

Top 0.1%

39.8%

Show abstract

Short-read sequencing (SRS) methods have improved the detection of small genetic variants but remain limited in highly homologous genomic regions, such as segmental duplications with gene-pseudogene pairs. These paralogous regions often require complex, locus-specific assays for accurate analysis. Long-read genome sequencing (lrGS) technologies, such as PacBio HiFi sequencing, can span these regions but still face challenges in variant calling due to alignment ambiguities. Here, we evaluated PacBio HiFi lrGS combined with Paraphase, a dedicated haplotype-based variant caller, in 86 individuals with 125 known clinically relevant variants across 11 paralogous loci. Standard HiFi variant callers detected 95/125 variants, while the remaining 30 variants were only identified by Paraphase. Together, the standard variant callers and Paraphase detected all known variants, including SNVs, InDels, CNVs, SVs, and gene conversions. In addition, lrGS allowed accurate phasing and gene-pseudogene copy number detection. We demonstrate that PacBio HiFi lrGS, particularly when integrated with Paraphase, enables comprehensive variant detection in previously difficult-to-assess genomic regions. These results also suggest that lrGS is ready for a wider implementation, possibly as a first-tier diagnostic approach for individuals with suspected variants in these paralogous regions.

19

Pangenome-based identification of cryptic pathogenic variants in undiagnosed rare disease patients

Jang, S. S.; Kim, S.; Lee, S.; Kim, S. Y.; Moon, J.; Kim, J.; Chae, J.-H.

2025-07-11 genetic and genomic medicine 10.1101/2025.07.08.25330875 medRxiv

Top 0.1%

39.8%

Show abstract

BackgroundDespite widespread implementation of exome and genome sequencing, a substantial proportion of rare disease patients remain undiagnosed due to inherent limitations in detecting structural, repetitive, and regulatory variants. MethodsWe applied long-read sequencing (LRS) to 40 individuals from 33 previously undiagnosed Korean families. De novo assemblies were integrated into a graph-based pangenome workflow, enabling sensitive detection of single-nucleotide, structural, and tandem-repeat variants and direct profiling of CpG methylation. ResultsPathogenic or likely pathogenic variants were identified in 9 (27.3%) families that had remained unsolved despite prior short-read sequencing. The discoveries comprised deep intronic splice-altering SNVs, non-coding regulatory deletions, complex rearrangements, large deletions, tandem repeat expansions, and aberrant methylation profiles. We also implicate CXXC1 as a novel disease-associated gene, potentially contributing to a global DNA methylation defects, and revealed novel pathogenic variants in established disease genes such as HEXB and NGLY1, providing insights into underrecognized genetic contributors to rare diseases. ConclusionsLRS coupled with pangenome-based, graph-driven analysis closed a sizable diagnostic gap, broadened the mutational spectra of several Mendelian genes and brought epigenomic evidence into rare disease investigation. These findings support the adoption of long-read, graph-based workflows as a front-line strategy for comprehensive genomic and epigenomic diagnosis.

20

Identification and validation of liquid biopsy-based methylation biomarkers: a germ cell tumor subtype-specific study

Janssen, F. W.; Gillis, A. J. M.; Kester, L. A.; Takami, H.; Ichimura, K.; Eleveld, T. F.; Looijenga, L. H. J.

2025-11-05 cancer biology 10.1101/2025.11.04.686501 medRxiv

Top 0.1%

39.7%

Show abstract

Human germ cell tumors (GCTs) occur in infants, children, and adults, and present as germinomatous and/or non-germinomatous (embryonal carcinoma, teratoma, yolk sac tumor (YST), and choriocarcinoma) histologies at gonadal or extragonadal locations. Accurate subtyping is crucial for prognosis and treatment, but current clinical biomarkers lack sensitivity and specificity (serum proteins) or require a tissue biopsy (immunohistochemistry). Hence, less-invasive and improved subtype-specific biomarkers have potential for clinical utility. We conducted a meta-analysis of DNA methylation data (450K/EPIC array) from 15 (three original and 12 published) datasets including 713 GCTs, 109 healthy testis, and 221 healthy peripheral blood samples, revealing that GCT histology is the primary driver of methylation profiles, regardless of tumor location and patients age or sex. Per subtype, we identified differentially methylated regions as potential biomarkers. As proof of concept, we identified and validated two YST-specific biomarkers, i.e., APC and DPP7 promotor methylation, using methylation-sensitive restriction enzyme-based qPCR, of which DPP7 was also detectable in GCT serum-derived cell-free DNA. In conclusion, we present a novel method for in silico identification and in vitro and in vivo validation of YST subtype-specific liquid biopsy-based biomarkers. Our bioinformatic pipeline is easily transferrable encouraging additional applications in pan(pediatric)-cancer studies beyond GCTs.